**CUNEF**

Practice III - Retail score analysis

Machine Learning</strong>

Pablo Mazariegos Reviriego - pablo.mazariegos@cunef.edu

Mario Sabater Pascual - mario.sabater@cunef.edu

Highway

In this Machine Learning practice we will be working with the Yelp dataset. The whole practice will be composed by the following notebooks:

  1. Data Reading and Problem statement
  2. Data selection & variables preparation
  3. **EDA**
  4. Model selection
  5. Best Model and Explainability
  6. Variables Relationship (Graphs)

In this notebook we will perform an Exploratory Data Analysis on the restaurants dataset we exported in the Data selection & variables preparation notebook. We will look at the variables correlation, null, non and nan analysis of the variables and if needed we will drop those that will not work for our future models.

  • Libraries import
  • Functions used in this notebook
  • EDA
    • Data reading & average income
    • Observation of target variable
    • Missing values
    • Outliers study
    • Variables correlation
    • Plotting variables vs the target variable
  • Variables Encoding & Normalization
    • Encoding
    • Normalization
  • Final data exportation

**Libraries import**¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import requests
import scipy.stats as ss
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 500)

**Functions used in this notebook**¶

In [2]:
def get_postal_code(lat, lng):
    # Replace YOUR_API_KEY with your actual API key
    api_key = 'AIzaSyDg-H2ltcfpG5kJ8WLlzCTuNLNuoR4y8us'
    endpoint = 'https://maps.googleapis.com/maps/api/geocode/json'
    params = {
        'latlng': f'{lat},{lng}',
        'key': api_key,
    }
    response = requests.get(endpoint, params=params)
    data = response.json()
    if data['status'] == 'OK':
        results = data['results']
        if results:
            address = results[0]['formatted_address']
            for component in results[0]['address_components']:
                if 'postal_code' in component['types']:
                    postal_code = component['long_name']
                    return postal_code
    return None

As we have identified that there are some missing values in postal_code and not in latitude and longitude, we will use the geocode API from google maps to identify the postal code of such location.

In [3]:
def get_corr_matrix(dataset=None, metodo='spearman', size_figure=[10,8]):
    # Para obtener la correlación de Spearman, sólo cambiar el metodo por 'spearman'

    if dataset is None:
        print(u'\nHace falta pasar argumentos a la función')
        return 1
    sns.set(style="white")
    # Compute the correlation matrix
    corr = dataset.corr(method=metodo)
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=size_figure)
    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, #center=0,
                square=True, linewidths=.5,  cmap='coolwarm', vmin=-1, vmax=1) #cbar_kws={"shrink": .5}
    plt.show()

During the EDA, one of the main focuses is the correlation beween variables. The function get_corr_matrix allows us to do this by computing the correlation matrix of the input dataframe using the specified method, and it sets the self-correlation to zero, and plots the resulting matrix as a heatmap using the seaborn library.

The funtion also has the method for the correlation as spearman and the function displays the plot using matplotlib.pyplot.show()

In [4]:
def cramers_v(var1,var2):
    """ 
    calculate Cramers V statistic for categorial-categorial association.
    uses correction from Bergsma and Wicher,
    Journal of the Korean Statistical Society 42 (2013): 323-328
    
    confusion_matrix: tabla creada con pd.crosstab()
    
    """
    crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None))
    chi2 = ss.chi2_contingency(crosstab)[0]
    n = crosstab.sum()
    phi2 = chi2 / n
    r, k = crosstab.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1),(rcorr-1)))

The previous function facilitates us to plot the spearman and pearson correlation matrix. The cramers_v function rather than plotting creates the correlation matrix using the cramers v method.

In [5]:
def plot_feature(df, col_name, isContinuous, target):
    """
    Visualize a variable with and without faceting on the loan status.
    - df dataframe
    - col_name is the variable name in the dataframe
    - full_name is the full variable name
    - continuous is True if the variable is continuous, False otherwise
    """
    
    f, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (12,3), dpi = 90)
    
    count_null = df[col_name].isnull().sum()
    if isContinuous:
        
        sns.histplot(df.loc[df[col_name].notnull(), col_name], kde = False, ax = ax1)
    else:
        sns.countplot(x=df[col_name], order=None,
                      color = '#5975A4', saturation = 1, ax = ax1)
    ax1.set_xlabel(col_name)
    ax1.set_ylabel('Count')
    ax1.set_title(col_name+ ' Numero de nulos: ' + str(count_null))
    plt.xticks(rotation = 90)


    if isContinuous:
        sns.boxplot(x = col_name, y = target, data = df, ax = ax2)
        ax2.set_ylabel('')
        ax2.set_title(col_name + ' by '+ target)
    else:
        data = df.groupby(col_name)[target].value_counts(normalize=True).to_frame('proportion').reset_index() 
        data.columns = [i, target, 'proportion']
        sns.barplot(x = col_name, y = 'proportion', hue = target, data = data, saturation = 1, ax = ax2)
        ax2.set_ylabel(target + ' fraction')
        ax2.set_title(target)
        plt.xticks(rotation = 90)
    ax2.set_xlabel(col_name)
    
    plt.tight_layout()

Being a supervised machine learning problem, the plot_feature allows us to look at the rest of the variables depending on the target variable value. The function creates a figure with two subplots.

The first subplot is a histogram or countplot of the values in the specified column, depending on whether the column is continuous or categorical.

The second subplot is either a boxplot or barplot showing the relationship between the values in the specified column and the target variable.

The function also displays the number of null values in the column in the title of the first subplot. Finally, the function rotates the x-axis labels for both subplots by 90 degrees and adjusts the layout of the figure using plt.tight_layout().

**EDA**¶

The data we will load for our Exploratory Data Analysis is the csv generated from the R Studio file. This is due to, such csv file contains the information selected in the previous notebook (1.Data selection & variables preparation) as well as the average income by zipcode in 2018 from The USCB (United States Census Bureau) though an API.

Also as we saw in the first notebook, the reviews were made from 2004 to 2018, hence the 2018 average income data is representative.

Data Reading and average income¶

In [6]:
pd.set_option('display.max_columns', 500)
restaurants = pd.read_csv('../data/processed/restaurants.csv', dtype={'postal_code': str})
restaurants.head(3)
Out[6]:
business_id name address city state postal_code latitude longitude stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsPriceRange2 RestaurantsTakeOut ByAppointmentOnly WiFi Alcohol Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin Smoking BYOBCorkage AcceptsInsurance RestaurantsCounterService AgesAllowed Premium_Parking Non_Premium_Parking dairy-free gluten-free vegan kosher halal soy-free vegetarian num_days_open num_hours_open
0 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 1 80 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 False False False True 1 True False free False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 7.0 94.0
1 mWMc6_wTdE0EUBKIGXDVfA Perkiomen Valley Brewery 101 Walnut St Green Lane PA 18054 40.338183 -75.471659 1 13 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 unknown unknown True True 2 True unknown unknown unknown False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 5.0 40.0
2 CF33F8-E6oudUQ46HnavjQ Sonic Drive-In 615 S Main St Ashland City TN 37015 36.269593 -87.058943 0 6 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True False 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0 66.0

Average_income

We obtained the average income data set through the API from https://www.census.gov/ An official website of the United States government.

The download of this data is done through R. The procedure is attached in the path: '../Rcode/avgincome.R'.

the data set "ZipCode_income" is composed of two columns, GEOID with the zip code of each state of the country and average_income which, as the name of the column says, is the average income per zip code.

In [7]:
income = pd.read_csv('../data/raw/income.csv', index_col=0)
income.head()
Out[7]:
GEOID average_income
1 35004 62902.0
2 35005 50473.0
3 35006 55909.0
4 35007 73587.0
5 35010 41954.0

We join the data set with a left join (so that it only joins the data where there is a restaurant).Then we drop the GEOID column.

In [8]:
income['GEOID'] = income['GEOID'].astype(str)
restaurants_income = pd.merge(restaurants, income, left_on='postal_code', right_on='GEOID', how='left')
restaurants_income = restaurants_income.drop('GEOID', axis=1)
restaurants_income.head()
Out[8]:
business_id name address city state postal_code latitude longitude stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsPriceRange2 RestaurantsTakeOut ByAppointmentOnly WiFi Alcohol Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin Smoking BYOBCorkage AcceptsInsurance RestaurantsCounterService AgesAllowed Premium_Parking Non_Premium_Parking dairy-free gluten-free vegan kosher halal soy-free vegetarian num_days_open num_hours_open average_income
0 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 1 80 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 False False False True 1 True False free False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 7.0 94.0 57735.0
1 mWMc6_wTdE0EUBKIGXDVfA Perkiomen Valley Brewery 101 Walnut St Green Lane PA 18054 40.338183 -75.471659 1 13 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 unknown unknown True True 2 True unknown unknown unknown False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 5.0 40.0 71196.0
2 CF33F8-E6oudUQ46HnavjQ Sonic Drive-In 615 S Main St Ashland City TN 37015 36.269593 -87.058943 0 6 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True False 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0 66.0 54097.0
3 k0hlBqXX-Bt0vf1op7Jr1w Tsevi's Pub And Grill 8025 Mackenzie Rd Affton MO 63123 38.565165 -90.321087 0 19 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 False True True unknown 1 True unknown free full_bar True unknown False unknown unknown unknown True unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False NaN NaN 58867.0
4 bBDDEgkFA1Otx9Lfe7BZUQ Sonic Drive-In 2312 Dickerson Pike Nashville TN 37207 36.208102 -86.768170 0 10 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True unknown 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0 76.0 42973.0

Duplicates observation

In [9]:
restaurants_income. duplicated().sum()
Out[9]:
0

The absence of duplicates leave us with a the full length of observations as we had at the beginning, at least for the moment.

Setting business_id as index

Also since the business_id is a unique value for each business, we check if the number of unique values are the same as the length of the dataframe.

In [10]:
len(restaurants_income['business_id'].unique()) - len(restaurants_income)
Out[10]:
0
In [11]:
restaurants_income.set_index('business_id', inplace=True)

By setting business_id as index we have it as identifier without loosing any data.

The dataset does not contain duplicated data.

Observation of target variable¶

In [12]:
plot_df = restaurants_income['stars']\
        .value_counts(normalize = True)\
        .mul(100).rename('percent').reset_index()

plot_df_conteo = restaurants_income['stars'].value_counts().reset_index()
plot_df_pc = pd.merge(plot_df, plot_df_conteo, 
on = ['index'], how = 'inner')
plot_df_pc 
Out[12]:
index percent stars
0 0 52.331885 26257
1 1 47.668115 23917
In [13]:
# create the barplot
sns.barplot(x='index', y='percent', data=plot_df_pc, palette=['darkred', 'darkgreen'])

# show the plot
plt.show()

We also took a look at this data in the previous notebbok. The current selection of the data (restaurants + the selected states), and the transformation of the stars variable made the target variable one practically totally balanced with 52.2% of the data as 0 (less than 4 stars) and 47.7% as 1 (successfull businesses).

Missing values¶

In the previous notebook (1. Data selection & variables preparation) we already managed the missing values of the columns coming from attributes. Now we have four columns that have missing data, postal_code, average_income, num_days_open and num_hours_open.

postal_code

In [14]:
sum(restaurants_income['postal_code'].isna())
Out[14]:
23
In [15]:
# Create a new dataframe with only the rows that have NA values in the column postal_code
postal_na = restaurants_income[restaurants_income["postal_code"].isna()]

# Select only the columns longitude and latitude from the new dataframe
postal_na = postal_na[["longitude", "latitude", "postal_code"]]

With the function get_postal_code we are able to obtain the postal code based on the latitude and longitude thanks to the Google Maps API

In [16]:
#Replace NA values in the postal_code column with the result of the get_postal_code function
postal_na['postal_code'] = postal_na[['latitude', 'longitude']].apply(
    lambda x: get_postal_code(x['latitude'], x['longitude']), axis=1)

postal_na['postal_code'].astype(str)
postal_na.head(3)
Out[16]:
longitude latitude postal_code
business_id
siwG4ZM7RjUDO52DI84m3w -82.640292 27.767601 33701
--sXnWH9Xm6_NvIjyuA99w -75.165222 39.952584 19102
UZ7mpi6qcsS0FqsH-cyfnA -90.080654 29.934027 70130

We include this postal codes on the datafreame through a merge via left join

In [17]:
# Merge the two dataframes on the business_id column, using the suffixesparameter to specify different suffixes for the columns
merged = pd.merge(restaurants_income, postal_na, on="business_id", how="left", suffixes=("", "_na"))

# Replace the NA values in the postal_code column with the values from the postal_code_na column
merged["postal_code"] = merged["postal_code"].fillna(merged["postal_code_na"])

# Drop the postal_code_na column
restaurants_income = merged.drop(columns=["postal_code_na","longitude_na","latitude_na"])
In [18]:
restaurants_income[restaurants_income["postal_code"].isna()].head()
Out[18]:
name address city state postal_code latitude longitude stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsPriceRange2 RestaurantsTakeOut ByAppointmentOnly WiFi Alcohol Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin Smoking BYOBCorkage AcceptsInsurance RestaurantsCounterService AgesAllowed Premium_Parking Non_Premium_Parking dairy-free gluten-free vegan kosher halal soy-free vegetarian num_days_open num_hours_open average_income
business_id
MxSSESMG5NJCDHy_zsmNCw Fire Monkey Food Truck NaN Saint Petersburg FL None 27.751828 -82.626734 1 6 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 unknown unknown True unknown 2 unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown True True False False False False False False False NaN NaN NaN

Now we only have one NA value for postal code. As we see in the name, the business is a food truck, hence it does not have a fix postal code. For this reason we will drop it from our dataset.

In [19]:
restaurants_income = restaurants_income.drop('MxSSESMG5NJCDHy_zsmNCw', axis=0)
sum(restaurants_income['postal_code'].isna())
Out[19]:
0

average_income missing values

The dataset provided by the USCB has some missing data.

In [20]:
sum(restaurants_income['average_income'].isna())
Out[20]:
538

For these we will replace them with the mean of the column "average_income" of the correspondant state:

In [21]:
# group the dataframe by the state column and calculate the median average_income for each group
median_income_by_state = restaurants_income.groupby('state')['average_income'].median()

# fill the missing values in the average_income column with the median income for the state
restaurants_income['average_income'] = restaurants_income['average_income'].fillna(
    restaurants_income['state'].map(median_income_by_state))
In [22]:
sum(restaurants_income['average_income'].isna())
Out[22]:
0

num_days_open & num_hours_open missing values

For these two variables we will replace the missing value with the median value of each column,

In [23]:
sum(restaurants_income['num_days_open'].isna())
Out[23]:
7394
In [24]:
sum(restaurants_income['num_hours_open'].isna())
Out[24]:
7394
In [25]:
# Calculate the medians of both columns
median_days_open = restaurants_income['num_days_open'].median()
median_hours_open = restaurants_income['num_hours_open'].median()

# Use the fillna method to fill the NA values with the medians
restaurants_income = restaurants_income.fillna({
    'num_days_open': median_days_open,
    'num_hours_open': median_hours_open
})
In [26]:
sum(restaurants_income['num_hours_open'].isna())
Out[26]:
0
In [27]:
sum(restaurants_income['num_days_open'].isna())
Out[27]:
0

Outliers study¶

Looking at the variables there is one possible column with outliers. The average_income. Besides that, the variables we have left are categorical or in a small range of numbers, hence the existence of outliers is nothing but difficult.

average_income

In [28]:
sns.boxplot(x='stars', y='average_income', data=restaurants_income)
Out[28]:
<AxesSubplot: xlabel='stars', ylabel='average_income'>

Despite we observe some outlier values, especially from $150.000 of avergae income, we observe these are balanced between the values of the target variable. Due to the nature of the average_income and the balanced values, we will not take further action.

Variables correlation¶

Spearman

In [29]:
df_spearman = restaurants_income[['name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude',
         'is_open', 'restaurants', 'food', 'bars', 'sandwiches', 'american (traditional)',
         'pizza', 'coffee & tea', 'breakfast & brunch', 'american (new)', 'fast food', 'burgers', 
         'mexican', 'italian', 'specialty food', 'seafood', 'desserts', 'bakeries', 'salad', 'chinese',
         'cafes', 'chicken wings', 'ice cream & frozen yogurt', 'beer', 'sports bars', 'cocktail bars',
         'juice bars & smoothies', 'barbeque', 'japanese', 'pubs', 'sushi bars', 'local flavor', 
         'asian fusion', 'diners', 'steakhouses', 'mediterranean', 'wine bars', 'vegetarian', 
         'southern', 'cajun/creole', 'donuts', 'tacos', 'soup', 'thai', 'beer bar', 'tex-mex', 'vegan',
         'greek', 'breweries', 'vietnamese', 'chicken shop', 'hot dogs', 'indian', 'gluten-free', 
         'bagels', 'comfort food', 'cheesesteaks', 'ethnic food', 'caribbean', 'middle eastern', 
         'french', 'soul food', 'buffets', 'gastropubs', 'fruits & veggies', 'korean', 'bubble tea', 
         'tapas/small plates', 'food stands', 'cupcakes', 'noodles', 'wineries', 'cuban', 'halal', 
         'spanish', 'acai bowls', 'RestaurantsDelivery', 'OutdoorSeating', 'BusinessAcceptsCreditCards',
         'BikeParking', 'RestaurantsPriceRange2', 'RestaurantsTakeOut', 'ByAppointmentOnly', 'WiFi',
         'Alcohol', 'Caters', 'WheelchairAccessible', 'RestaurantsReservations', 'CoatCheck',
         'DogsAllowed', 'RestaurantsTableService', 'HasTV', 'HappyHour', 'DriveThru', 
         'BusinessAcceptsBitcoin', 'Smoking', 'BYOBCorkage', 'AcceptsInsurance', 
         'RestaurantsCounterService', 'AgesAllowed', 'Premium_Parking', 'Non_Premium_Parking',
         'review_count', 'num_days_open', 'num_hours_open', 'average_income']]

First, in order to know the correlation between variables, we will plot a heatmat of the spearman correlations between them. For that we will call the funtion _get_corrmatrix from the functions notebook. The function will only call for numerical variables. After, we will do a Cramer's V for all categorical variables (despite having numeric or string values).

In [30]:
get_corr_matrix(dataset = df_spearman, size_figure = [10,8])
In [67]:
spearman = df_spearman.corr(method = 'spearman')
spearman
Out[67]:
latitude longitude is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars vegetarian southern cajun/creole donuts tacos soup thai beer bar tex-mex vegan greek breweries vietnamese chicken shop hot dogs indian gluten-free bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban halal spanish acai bowls RestaurantsPriceRange2 review_count num_days_open num_hours_open average_income
latitude 1.000000 0.325075 -0.016491 0.027388 0.001211 -0.041527 0.008887 -0.007611 0.063944 0.011912 -0.011915 0.005900 -0.028787 -0.010872 -0.026744 0.041775 -0.001523 -0.063391 -0.005809 0.019454 -0.002704 0.039429 -0.020700 -0.033794 -0.000629 -0.003186 -0.027068 -0.022668 -0.017733 -0.010750 0.021157 0.015554 -0.012032 -0.028461 0.014077 0.004034 -0.008005 0.006200 -0.024033 0.003997 -0.036719 -0.066683 0.003028 -0.026553 0.002496 -0.004158 -0.006759 -0.016743 0.002925 -0.033908 0.006665 -0.002366 -0.017584 -0.009597 0.032477 0.002528 0.023600 -0.005653 0.057046 0.001777 -0.027089 0.005478 0.007847 -0.004826 -0.000143 0.019532 0.001137 0.033884 0.004235 -0.012490 0.002722 0.000773 0.009050 -0.001909 -0.074600 -0.004028 -0.027914 -0.019389 0.011775 -0.038601 0.000701 0.006403 0.359561
longitude 0.325075 1.000000 -0.016811 0.033527 -0.006060 -0.030242 0.018106 -0.011439 0.066910 0.015464 -0.014608 -0.003613 -0.072733 -0.044886 -0.081832 0.055422 0.004197 0.001256 -0.009048 0.001645 -0.006184 0.032571 -0.000295 -0.009245 -0.001785 -0.018407 -0.016326 -0.034118 -0.005286 -0.021464 0.022974 0.014063 0.011056 -0.032743 0.019378 0.006938 -0.004152 0.012926 -0.035545 -0.000707 -0.026399 -0.060123 0.025470 -0.038881 0.001267 0.019138 -0.013772 -0.021209 -0.006663 -0.000599 -0.007127 0.004681 0.000043 -0.008254 0.037610 -0.002816 0.024452 -0.001551 0.082388 0.010897 0.032513 0.013511 -0.001033 0.005380 -0.003452 0.015527 0.018506 0.021453 0.012110 -0.008728 -0.001120 0.002329 0.011214 -0.040830 0.016522 -0.003285 0.019574 0.000229 0.022546 -0.043040 0.032250 0.012135 0.215675
is_open -0.016491 -0.016811 1.000000 -0.099063 0.078308 -0.005702 0.027231 -0.003834 0.014837 0.033827 0.013118 -0.062750 0.108176 0.046277 0.011828 -0.026212 0.021357 0.003068 -0.003407 0.001332 0.022772 0.012327 -0.027402 0.043725 -0.010118 0.019261 0.003912 0.029781 0.014090 -0.042038 -0.017849 -0.013137 -0.011691 0.038514 -0.022316 -0.015380 -0.004589 -0.038714 -0.003470 -0.003672 -0.010611 -0.020656 0.037624 0.046675 0.014439 -0.017961 0.029283 0.016692 0.000792 -0.025627 0.021372 -0.025252 0.042816 -0.000276 -0.038791 -0.000546 0.000922 0.004559 0.001948 -0.011572 -0.005214 -0.024092 -0.045955 -0.011852 -0.041484 -0.011695 0.025531 -0.018817 0.010778 -0.016890 -0.003457 0.006251 0.017769 0.008283 -0.009802 0.003041 -0.016603 0.012109 -0.019856 0.147521 -0.041420 0.050457 -0.023151
restaurants 0.027388 0.033527 -0.099063 1.000000 -0.416698 -0.022544 0.206131 0.202098 0.180776 -0.123121 0.176980 0.176369 0.170524 0.159859 0.145903 0.145114 -0.181836 0.132766 -0.101108 -0.079704 0.118613 0.117191 0.115171 0.114753 -0.178104 -0.129852 0.032582 0.020348 -0.060231 0.089521 0.089278 0.000583 0.087808 -0.183017 0.083038 0.080405 0.079753 0.077847 0.007545 0.007275 0.071257 0.070009 -0.101027 0.067985 0.067717 0.066315 -0.012681 0.064088 0.006416 0.061345 -0.079923 0.059449 0.059196 0.058170 0.057755 0.008401 0.008634 0.056279 0.053436 0.006268 0.052872 0.052645 0.050201 0.050022 0.048995 0.048505 -0.110171 0.046494 -0.038124 0.046108 0.045193 -0.096098 0.040173 -0.093867 0.040099 0.002425 0.038742 -0.015759 -0.044820 0.259518 -0.044256 -0.094991 0.001212
food 0.001211 -0.006060 0.078308 -0.416698 1.000000 -0.183220 -0.007051 -0.144637 -0.152022 0.393910 -0.006328 -0.100065 -0.068056 -0.036670 -0.116095 -0.107865 0.307835 -0.096580 0.270057 0.261544 -0.013312 -0.136172 0.087565 -0.107508 0.242164 0.231579 -0.093200 -0.057435 0.197325 -0.050810 -0.083570 -0.046726 -0.081191 -0.035714 -0.055632 -0.065738 -0.087344 -0.049839 -0.005708 -0.005238 -0.049943 -0.052708 0.152506 -0.043367 -0.001855 -0.062465 0.007830 -0.071902 -0.003097 -0.050494 0.131090 -0.028963 -0.056097 0.003794 -0.032220 0.005288 0.125369 -0.015441 -0.030538 0.115854 -0.029956 -0.023907 -0.024862 -0.034143 -0.047029 -0.019500 0.104430 -0.038287 0.101521 -0.028770 0.043011 0.094243 -0.014992 0.087677 -0.009282 -0.003759 -0.023586 0.081987 -0.056956 -0.123659 -0.035687 0.065196 0.030264
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
RestaurantsPriceRange2 0.011775 0.022546 -0.019856 -0.044820 -0.056956 0.174575 -0.149228 0.055691 -0.059063 -0.117828 -0.043575 0.134983 -0.299495 -0.117304 -0.066565 0.059230 0.043334 0.133241 0.001894 -0.012124 0.010729 -0.047298 -0.020085 -0.072414 -0.101493 0.071743 0.045963 0.108846 -0.007134 0.051308 0.082806 0.033375 0.095492 0.053476 0.057409 -0.054332 0.116241 0.026844 0.104762 0.006891 0.046667 0.063355 -0.110134 -0.042771 -0.001802 0.054308 0.037587 -0.048443 0.000173 -0.012945 0.027752 -0.014631 -0.057228 -0.061679 0.040224 0.001915 -0.065194 0.012942 -0.031450 -0.029234 0.014855 0.003793 0.090281 0.020436 0.031339 0.049660 -0.021295 0.022149 -0.013270 0.065606 -0.042844 0.021478 0.019894 0.046463 -0.021933 -0.005680 0.011003 0.025986 1.000000 0.125927 -0.124530 -0.149190 0.090668
review_count -0.038601 -0.043040 0.147521 0.259518 -0.123659 0.180616 0.021593 0.149602 0.016604 -0.014967 0.142104 0.173294 -0.107139 0.023072 0.052943 0.087915 -0.053072 0.127676 0.035722 0.000167 0.095399 0.016357 0.042472 -0.023786 -0.036865 -0.000203 0.054833 0.119406 -0.012987 0.042282 0.091009 0.070931 0.105363 -0.025446 0.081959 0.054794 0.102339 0.054929 0.087012 0.012734 0.077455 0.064167 -0.030412 0.002299 0.055264 0.071608 0.061373 0.025700 0.005973 0.033859 0.048662 0.051966 -0.002811 -0.009589 0.039678 0.013862 0.021912 0.054818 0.012933 0.013523 -0.002015 0.034317 0.058017 0.000185 0.022690 0.077928 -0.036513 0.038349 0.026013 0.061668 -0.028448 -0.005197 0.049837 0.004288 0.015833 0.006506 0.020011 0.000339 0.125927 1.000000 -0.168173 -0.122808 0.053423
num_days_open 0.000701 0.032250 -0.041420 -0.044256 -0.035687 -0.022900 -0.031118 -0.005622 0.047547 0.018008 -0.031066 -0.057206 0.070653 0.034801 0.017350 -0.032967 -0.056696 -0.038864 -0.078001 -0.060153 -0.064256 0.041239 -0.064535 0.041727 0.015450 -0.014619 0.033225 -0.068469 -0.042390 -0.055032 -0.016986 0.016189 -0.015693 -0.006267 -0.035684 0.025539 -0.014630 -0.027978 -0.068212 -0.007758 -0.058011 -0.024884 0.044336 0.004348 -0.032706 -0.038574 -0.036578 0.034141 -0.010016 -0.004536 -0.047782 -0.029459 -0.015652 -0.007527 -0.015011 -0.009248 0.013522 -0.045215 -0.001421 -0.004102 -0.040617 -0.009828 -0.055770 -0.049740 0.025247 -0.014441 -0.012133 -0.020854 -0.011974 -0.044361 -0.025905 -0.052264 -0.033667 -0.037557 -0.031569 -0.004547 -0.029043 -0.027323 -0.124530 -0.168173 1.000000 0.537661 -0.008376
num_hours_open 0.006403 0.012135 0.050457 -0.094991 0.065196 -0.069278 0.005760 -0.021950 0.009873 0.110445 -0.032414 -0.071355 0.100346 0.032577 0.032558 -0.049348 -0.031610 -0.047503 -0.067458 -0.028276 -0.023859 0.059086 -0.060501 0.022808 0.012618 -0.003444 -0.017836 -0.050506 0.007645 -0.045579 -0.040901 -0.017081 -0.044865 -0.048343 -0.027753 0.036319 -0.053329 -0.028734 -0.073837 -0.002338 -0.040655 -0.025070 0.093133 -0.007760 -0.002420 -0.052905 -0.042420 -0.007357 -0.002805 -0.007931 -0.065599 -0.008813 0.031857 -0.019607 -0.047446 -0.002684 0.007164 -0.015508 0.002523 -0.002797 -0.025457 -0.005421 -0.066911 -0.046572 0.007105 -0.019985 0.007983 -0.017693 -0.005085 -0.048812 -0.035907 -0.037912 -0.015266 -0.050083 -0.017790 -0.001741 -0.022737 -0.002260 -0.149190 -0.122808 0.537661 1.000000 -0.000869
average_income 0.359561 0.215675 -0.023151 0.001212 0.030264 -0.021414 0.019847 0.014241 0.039720 0.003601 -0.001124 0.030801 -0.033753 -0.017167 -0.031418 0.068954 0.007227 -0.024857 0.021077 0.025763 0.028924 -0.005995 0.004635 -0.044579 0.039999 0.012880 -0.002241 -0.015522 0.015229 -0.008137 0.031638 0.014027 0.030476 -0.029968 0.019031 -0.006040 0.018106 0.009765 0.009732 0.010919 -0.020076 -0.070676 0.003136 -0.021179 0.015970 0.023980 -0.002817 -0.002368 0.007632 -0.002377 -0.000028 -0.027073 -0.018937 -0.008414 0.017024 0.009383 0.030172 -0.008955 0.018234 -0.019367 -0.059040 -0.017895 0.017151 -0.059997 -0.010188 0.012830 0.010252 0.000171 -0.018902 0.002609 -0.033796 0.022176 -0.001447 0.014389 -0.031961 0.001174 -0.023642 0.023190 0.090668 0.053423 -0.008376 -0.000869 1.000000

83 rows × 83 columns

After checking the results, the only correlating higher than 0.6 in absolute termns is the correlation between the variables vegan and vegetarian. Since all vegan restaurants can also be vonsider vegetarian, as well as, many vegetarian restaurants also provides a huge amount of vegan plates they can get the "vegan" attribute. This variables have a correlation of 0.755891.

Cramer's V

First we define the categorical variables (excluding the target variable)

In [32]:
df_categorical_variables = restaurants_income[['city', 'state', 'postal_code',
         'is_open', 'restaurants', 'food', 'bars', 'sandwiches', 'american (traditional)',
         'pizza', 'coffee & tea', 'breakfast & brunch', 'american (new)', 'fast food', 'burgers', 
         'mexican', 'italian', 'specialty food', 'seafood', 'desserts', 'bakeries', 'salad', 'chinese',
         'cafes', 'chicken wings', 'ice cream & frozen yogurt', 'beer', 'sports bars', 'cocktail bars',
         'juice bars & smoothies', 'barbeque', 'japanese', 'pubs', 'sushi bars', 'local flavor', 
         'asian fusion', 'diners', 'steakhouses', 'mediterranean', 'wine bars', 'vegetarian', 
         'southern', 'cajun/creole', 'donuts', 'tacos', 'soup', 'thai', 'beer bar', 'tex-mex', 'vegan',
         'greek', 'breweries', 'vietnamese', 'chicken shop', 'hot dogs', 'indian', 'gluten-free', 
         'bagels', 'comfort food', 'cheesesteaks', 'ethnic food', 'caribbean', 'middle eastern', 
         'french', 'soul food', 'buffets', 'gastropubs', 'fruits & veggies', 'korean', 'bubble tea', 
         'tapas/small plates', 'food stands', 'cupcakes', 'noodles', 'wineries', 'cuban', 'halal', 
         'spanish', 'acai bowls', 'RestaurantsDelivery', 'OutdoorSeating', 'BusinessAcceptsCreditCards',
         'BikeParking', 'RestaurantsPriceRange2', 'RestaurantsTakeOut', 'ByAppointmentOnly', 'WiFi',
         'Alcohol', 'Caters', 'WheelchairAccessible', 'RestaurantsReservations', 'CoatCheck',
         'DogsAllowed', 'RestaurantsTableService', 'HasTV', 'HappyHour', 'DriveThru', 
         'BusinessAcceptsBitcoin', 'Smoking', 'BYOBCorkage', 'AcceptsInsurance', 
         'RestaurantsCounterService', 'AgesAllowed', 'Premium_Parking', 'Non_Premium_Parking']]

Then, we call from the fucntions auxiliar notebook the _cramersv function, and we plot its heatmap

In [33]:
# Initialize an empty list to store the results of the Cramer's V test
rows = []

# Loop over all pairs of categorical variables in df_categorical_variables
for var1 in df_categorical_variables:
  col = []
  for var2 in df_categorical_variables:
    # Calculate Cramer's V statistic for the current pair of variables
    cramers = cramers_v(df_categorical_variables[var1], df_categorical_variables[var2])
    # Round the Cramer's V statistic to two decimal places and add it to the list col
    col.append(round(cramers, 2))
  # Add the list col to the list rows
  rows.append(col)

# Convert the list of lists to a NumPy array and use it to create a new dataframe df_vcramer
cramers_results = np.array(rows)
df_vcramer = pd.DataFrame(cramers_results, columns=df_categorical_variables.columns, index=df_categorical_variables.columns)

# Plot a heatmap of the p-values in df_vcramer using seaborn
sns.heatmap(df_vcramer, vmin=0, vmax=1, square=True, linewidths=0.5, cmap='coolwarm')
Out[33]:
<AxesSubplot: >
In [34]:
df_vcramer
Out[34]:
city state postal_code is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars vegetarian southern cajun/creole donuts tacos soup thai beer bar tex-mex vegan greek breweries vietnamese chicken shop hot dogs indian gluten-free bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban halal spanish acai bowls RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsPriceRange2 RestaurantsTakeOut ByAppointmentOnly WiFi Alcohol Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin Smoking BYOBCorkage AcceptsInsurance RestaurantsCounterService AgesAllowed Premium_Parking Non_Premium_Parking
city 1.00 0.99 0.65 0.12 0.09 0.08 0.13 0.05 0.09 0.12 0.03 0.05 0.08 0.18 0.13 0.12 0.11 0.04 0.15 0.04 0.04 0.04 0.04 0.02 0.10 0.08 0.00 0.04 0.05 0.00 0.07 0.00 0.05 0.00 0.11 0.00 0.09 0.06 0.08 0.07 0.02 0.11 0.24 0.00 0.07 0.02 0.03 0.00 0.00 0.00 0.09 0.07 0.04 0.00 0.04 0.10 0.00 0.00 0.06 0.12 0.02 0.06 0.05 0.08 0.05 0.00 0.00 0.08 0.09 0.00 0.00 0.10 0.07 0.00 0.17 0.11 0.00 0.03 0.00 0.12 0.16 0.13 0.14 0.10 0.08 0.06 0.07 0.19 0.08 0.10 0.10 0.08 0.13 0.08 0.09 0.10 0.11 0.03 0.06 0.02 0.15 0.00 0.00 0.13 0.11
state 0.99 1.00 0.99 0.05 0.06 0.05 0.06 0.02 0.05 0.09 0.03 0.02 0.03 0.11 0.06 0.12 0.07 0.02 0.10 0.01 0.03 0.02 0.05 0.02 0.05 0.01 0.04 0.04 0.05 0.03 0.05 0.03 0.04 0.03 0.07 0.02 0.02 0.02 0.01 0.05 0.00 0.13 0.23 0.03 0.05 0.02 0.03 0.04 0.04 0.00 0.05 0.03 0.04 0.03 0.03 0.04 0.01 0.03 0.01 0.09 0.00 0.05 0.02 0.05 0.04 0.02 0.02 0.02 0.03 0.01 0.02 0.02 0.00 0.01 0.12 0.10 0.01 0.04 0.03 0.08 0.10 0.09 0.09 0.05 0.05 0.05 0.04 0.17 0.04 0.08 0.05 0.06 0.10 0.04 0.06 0.05 0.07 0.03 0.05 0.07 0.01 0.02 0.02 0.05 0.04
postal_code 0.65 0.99 1.00 0.17 0.10 0.10 0.21 0.05 0.10 0.13 0.03 0.05 0.11 0.20 0.10 0.15 0.12 0.07 0.15 0.06 0.06 0.04 0.09 0.06 0.14 0.03 0.05 0.06 0.12 0.05 0.08 0.02 0.10 0.00 0.15 0.03 0.05 0.07 0.05 0.09 0.04 0.13 0.26 0.03 0.04 0.00 0.07 0.07 0.00 0.00 0.04 0.09 0.10 0.00 0.03 0.11 0.00 0.05 0.00 0.11 0.07 0.11 0.04 0.09 0.12 0.07 0.00 0.09 0.16 0.05 0.06 0.11 0.02 0.00 0.18 0.12 0.00 0.09 0.00 0.15 0.21 0.16 0.19 0.14 0.11 0.07 0.09 0.22 0.11 0.13 0.13 0.10 0.17 0.09 0.12 0.13 0.12 0.08 0.11 0.09 0.00 0.14 0.00 0.24 0.16
is_open 0.12 0.05 0.17 1.00 0.10 0.08 0.00 0.03 0.00 0.01 0.03 0.01 0.06 0.11 0.05 0.01 0.03 0.02 0.00 0.00 0.00 0.02 0.01 0.03 0.04 0.01 0.02 0.00 0.03 0.01 0.04 0.02 0.01 0.01 0.04 0.02 0.01 0.00 0.04 0.00 0.00 0.01 0.02 0.04 0.05 0.01 0.02 0.03 0.02 0.00 0.03 0.02 0.02 0.04 0.00 0.04 0.00 0.00 0.00 0.00 0.01 0.00 0.02 0.05 0.01 0.04 0.01 0.02 0.02 0.01 0.02 0.00 0.00 0.02 0.01 0.01 0.00 0.02 0.01 0.27 0.13 0.10 0.13 0.04 0.06 0.11 0.03 0.13 0.06 0.12 0.14 0.06 0.15 0.08 0.16 0.10 0.17 0.05 0.04 0.04 0.01 0.01 0.02 0.06 0.06
restaurants 0.09 0.06 0.10 0.10 1.00 0.42 0.02 0.21 0.20 0.18 0.12 0.18 0.18 0.17 0.16 0.15 0.14 0.18 0.13 0.10 0.08 0.12 0.12 0.11 0.11 0.18 0.13 0.03 0.02 0.06 0.09 0.09 0.00 0.09 0.18 0.08 0.08 0.08 0.08 0.01 0.00 0.07 0.07 0.10 0.07 0.07 0.07 0.01 0.06 0.00 0.06 0.08 0.06 0.06 0.06 0.06 0.01 0.01 0.06 0.05 0.00 0.05 0.05 0.05 0.05 0.05 0.05 0.11 0.05 0.04 0.05 0.04 0.10 0.04 0.09 0.04 0.00 0.04 0.01 0.53 0.52 0.06 0.03 0.04 0.45 0.11 0.35 0.52 0.30 0.04 0.62 0.03 0.10 0.35 0.56 0.15 0.03 0.05 0.06 0.09 0.05 0.01 0.04 0.03 0.05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
AcceptsInsurance 0.15 0.01 0.00 0.01 0.05 0.03 0.01 0.01 0.01 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.01 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.03 0.00 0.01 0.01 0.03 0.10 0.01 0.03 0.01 0.02 0.03 0.00 0.02 0.01 0.03 0.01 0.01 0.02 0.00 0.00 1.00 0.00 0.00 0.01 0.01
RestaurantsCounterService 0.00 0.02 0.14 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.02 0.00 0.00 0.06 0.00 1.00 0.00 0.01 0.00
AgesAllowed 0.00 0.02 0.00 0.02 0.04 0.04 0.06 0.08 0.01 0.01 0.02 0.10 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.02 0.01 0.02 0.01 0.03 0.01 0.01 0.03 0.02 0.02 0.01 0.07 0.01 0.01 0.02 0.03 0.00 0.05 0.06 0.00 0.00 0.00 1.00 0.02 0.01
Premium_Parking 0.13 0.05 0.24 0.06 0.03 0.05 0.13 0.03 0.06 0.05 0.03 0.06 0.12 0.08 0.02 0.02 0.02 0.03 0.07 0.02 0.02 0.02 0.02 0.03 0.07 0.02 0.04 0.03 0.10 0.01 0.02 0.03 0.03 0.04 0.10 0.04 0.03 0.08 0.02 0.07 0.00 0.03 0.01 0.01 0.03 0.02 0.02 0.02 0.00 0.00 0.02 0.02 0.02 0.04 0.01 0.01 0.01 0.02 0.02 0.01 0.03 0.02 0.01 0.04 0.03 0.02 0.04 0.02 0.01 0.01 0.06 0.02 0.00 0.01 0.01 0.01 0.00 0.01 0.02 0.13 0.23 0.32 0.28 0.14 0.17 0.07 0.24 0.28 0.22 0.13 0.22 0.12 0.15 0.13 0.21 0.16 0.05 0.11 0.10 0.05 0.01 0.01 0.02 1.00 0.71
Non_Premium_Parking 0.11 0.04 0.16 0.06 0.05 0.06 0.09 0.00 0.09 0.05 0.03 0.08 0.09 0.10 0.00 0.01 0.03 0.03 0.04 0.03 0.03 0.03 0.05 0.03 0.07 0.02 0.05 0.04 0.04 0.00 0.03 0.03 0.05 0.04 0.10 0.03 0.05 0.05 0.03 0.04 0.00 0.03 0.02 0.01 0.03 0.02 0.03 0.04 0.00 0.01 0.02 0.05 0.02 0.03 0.00 0.02 0.00 0.02 0.03 0.00 0.03 0.02 0.02 0.03 0.02 0.01 0.04 0.02 0.02 0.00 0.02 0.02 0.00 0.01 0.02 0.01 0.00 0.01 0.01 0.14 0.25 0.33 0.31 0.09 0.18 0.09 0.26 0.27 0.28 0.14 0.21 0.09 0.18 0.14 0.22 0.16 0.06 0.15 0.08 0.07 0.01 0.00 0.01 0.71 1.00

105 rows × 105 columns

The cramers v also ilustrates us the correlation between the vegan and vegetarian variables as we have seen before. Also we can see the correlation between city, state and postal_code as they all represent location.

Not a high correlation but some information a cramers v can yield is the correlation between the aforementioned location variables and the types of restaurant. For example the cajun/creole variable is for restaurants with West Affrican, French and Spanish roots typical from Loissiana.

Since most of the restaurants with this cusine will be in a reduced number of postal codes or cities, they have a high correlation (compared to other type of food variables) with the location variables.

Pearson

Finally we do a spearman correlation between the continous numerical variables, in case we can obtain any additional information than with the pearson correlation. For this, as before, we set a new dataframe with only the continous numerical variables.

In [35]:
df_continous_vairables = restaurants_income[['latitude', 'longitude', 'review_count', 'num_days_open', 'num_hours_open', 'average_income']]
df_continous_vairables.head()
Out[35]:
latitude longitude review_count num_days_open num_hours_open average_income
business_id
MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 80 7.0 94.0 57735.0
mWMc6_wTdE0EUBKIGXDVfA 40.338183 -75.471659 13 5.0 40.0 71196.0
CF33F8-E6oudUQ46HnavjQ 36.269593 -87.058943 6 6.0 66.0 54097.0
k0hlBqXX-Bt0vf1op7Jr1w 38.565165 -90.321087 19 7.0 63.0 58867.0
bBDDEgkFA1Otx9Lfe7BZUQ 36.208102 -86.768170 10 6.0 76.0 42973.0
In [36]:
get_corr_matrix(dataset = df_continous_vairables,
                metodo = 'pearson', size_figure = [10,8])
In [37]:
df_continous_vairables.corr(method = 'pearson')
Out[37]:
latitude longitude review_count num_days_open num_hours_open average_income
latitude 1.000000 0.132024 -0.034543 -0.005041 -0.009474 0.339397
longitude 0.132024 1.000000 -0.038454 0.018553 -0.000583 0.200906
review_count -0.034543 -0.038454 1.000000 -0.058043 -0.081420 -0.013200
num_days_open -0.005041 0.018553 -0.058043 1.000000 0.474899 0.003031
num_hours_open -0.009474 -0.000583 -0.081420 0.474899 1.000000 -0.018921
average_income 0.339397 0.200906 -0.013200 0.003031 -0.018921 1.000000

Ending with the correlation study, the continous variables of the dataset, does not yield any high association. Number of days and hours open have a 0.47 correlation as is natural. Also, a 0.33 and 0.2 correlation between average_income and latidude and longitude respectively, as average_income depends on postal_code, hence its correlation.

Plotting variables vs the target variable¶

We will not plot the name, address, latitude and longitude as every row is unique. Also due to the reduced number of variables city, postal_code, review_count, average_income and num_hours_open will not be included in the for used with the plot_feature function. We will perform visualization on these outside the for.

In [38]:
plot_variables_df = restaurants_income.drop(columns=['name', 'address', 'latitude', 'longitude', 
                                                     'city', 'postal_code', 'review_count', 
                                                    'average_income', 'num_hours_open'])
In [39]:
plot_variables_df.head()
Out[39]:
state stars is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsPriceRange2 RestaurantsTakeOut ByAppointmentOnly WiFi Alcohol Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin Smoking BYOBCorkage AcceptsInsurance RestaurantsCounterService AgesAllowed Premium_Parking Non_Premium_Parking dairy-free gluten-free vegan kosher halal soy-free vegetarian num_days_open
business_id
MTSW4McQd7CbVtyjqoe9mw PA 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 False False False True 1 True False free False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 7.0
mWMc6_wTdE0EUBKIGXDVfA PA 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 unknown unknown True True 2 True unknown unknown unknown False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 5.0
CF33F8-E6oudUQ46HnavjQ TN 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True False 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0
k0hlBqXX-Bt0vf1op7Jr1w MO 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 False True True unknown 1 True unknown free full_bar True unknown False unknown unknown unknown True unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 7.0
bBDDEgkFA1Otx9Lfe7BZUQ TN 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True unknown 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0
In [40]:
for i in list(plot_variables_df.columns):
    if (plot_variables_df[i].dtype == float) & (i != 'stars'):
        plot_feature(plot_variables_df, col_name = i, isContinuous = True, target = 'stars')
    elif  i != 'stars':
        plot_feature(plot_variables_df, col_name = i, isContinuous = False, target = 'stars')

Despite the amount of information provided, it migh not seem nothing to highlight, since the distributions are more or less balanced. But taken a closer look we obtain a few insights.

On one hand, and maybe as one can imagine, fast food business or those who have drive through have more probabilities of having a rating lower than 4 stars (around 80%) as expected. On the other hand, those business that have the attribute of food stands or bubble tea establishments have around 75-80% of 4 stars or more.

Some exceptions might be that all hahal tagged business have four or more stars, meanwhile all soy-free tagged business are considered as unsucessfull, as well as those that does not have counter service.

Average income vs stars

In [41]:
# Create a figure and two subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Use the 'review_count' column with values of the 'stars' column equal to 0
sns.distplot(restaurants_income[restaurants_income['stars'] == 0]['average_income'], 
             bins=20, ax=axs[0])
axs[0].set_title("Distribution for 'stars' = 0")

# Use the 'review_count' column with values of the 'stars' column equal to 1
sns.distplot(restaurants_income[restaurants_income['stars'] == 1]['average_income'],
             bins=20, color='darkorange', ax=axs[1])
axs[1].set_title("Distribution for 'stars' = 1")

# Show the plots
plt.show()

Review count vs stars

In [42]:
# Create a figure and two subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Use the 'review_count' column with values of the 'stars' column equal to 0
sns.distplot(restaurants_income[restaurants_income['stars'] == 0]['review_count'].apply(np.log1p), 
             bins=10, ax=axs[0])
axs[0].set_title("Distribution for 'stars' = 0")

# Use the 'review_count' column with values of the 'stars' column equal to 1
sns.distplot(restaurants_income[restaurants_income['stars'] == 1]['review_count'].apply(np.log1p), 
             bins=10, color='darkorange', ax=axs[1])
axs[1].set_title("Distribution for 'stars' = 1")

# Show the plots
plt.show()

Number of hours open vs stars

In [43]:
# Create a figure and two subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Use the 'review_count' column with values of the 'stars' column equal to 0
sns.distplot(restaurants_income[restaurants_income['stars'] == 0]['num_hours_open'],
             bins=10, ax=axs[0])
axs[0].set_title("Distribution for 'stars' = 0")

# Use the 'review_count' column with values of the 'stars' column equal to 1
sns.distplot(restaurants_income[restaurants_income['stars'] == 1]['num_hours_open'],
             bins=10, color='darkorange', ax=axs[1])
axs[1].set_title("Distribution for 'stars' = 1")

# Show the plots
plt.show()

For these three variables we have plot separately we do not observe a huge variation between the densities of 1 and 0 values from the target variable. No further comments.

Variables Encoding & Normalization

In order to process our data we will encode our data, giving them numeric values and normalizate the whole dataset in such a way the highest number of each variable is 1 and the lowest 0.

But first, we will drop the following variables: name', 'address', 'latitude', 'longitude' and 'city'. This is due to they are mostly unique values if not unique. Hence, they will not provide useful information to our models.

In [44]:
restaurants_income = restaurants_income.drop(columns=['name', 'address', 'latitude', 'longitude', 
                                                     'city'])
In [45]:
restaurants_income.head()
Out[45]:
state postal_code stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsPriceRange2 RestaurantsTakeOut ByAppointmentOnly WiFi Alcohol Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin Smoking BYOBCorkage AcceptsInsurance RestaurantsCounterService AgesAllowed Premium_Parking Non_Premium_Parking dairy-free gluten-free vegan kosher halal soy-free vegetarian num_days_open num_hours_open average_income
business_id
MTSW4McQd7CbVtyjqoe9mw PA 19107 1 80 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 False False False True 1 True False free False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 7.0 94.0 57735.0
mWMc6_wTdE0EUBKIGXDVfA PA 18054 1 13 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 unknown unknown True True 2 True unknown unknown unknown False True unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 5.0 40.0 71196.0
CF33F8-E6oudUQ46HnavjQ TN 37015 0 6 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True False 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0 66.0 54097.0
k0hlBqXX-Bt0vf1op7Jr1w MO 63123 0 19 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 False True True unknown 1 True unknown free full_bar True unknown False unknown unknown unknown True unknown unknown unknown unknown unknown unknown unknown unknown False True False False False False False False False 7.0 63.0 58867.0
bBDDEgkFA1Otx9Lfe7BZUQ TN 37207 0 10 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 True True True unknown 1 True False False False False True False False False False True False True unknown unknown unknown unknown unknown unknown False False False False False False False False False 6.0 76.0 42973.0

Encoding¶

The first encoding we will perform is to all attributes variables except for the ones dropped below. These variables have unique values of True, False and Unknown.

In [46]:
selected_columns = restaurants_income.iloc[:, 76:-3].drop(
    columns=['WiFi', 'Alcohol', 'Smoking', 'BYOBCorkage','AgesAllowed', 
             'RestaurantsPriceRange2', 'Premium_Parking','Non_Premium_Parking'])
selected_columns = selected_columns.apply(lambda x: x.astype(str))

We encode True as 1, False as 0 and unknown as 0.5

In [47]:
selected_columns = selected_columns.applymap(
    lambda x: 1 if x == "True" else (0 if x == "False" else 0.5))

The parking variables had unique values of True, False and 0. In this case we will transform the True in 1 and the rest 0.

In [48]:
parking_columns = restaurants_income[['Premium_Parking','Non_Premium_Parking']]
In [49]:
parking_columns = parking_columns.applymap(
    lambda x: 1 if x == "True" else 0)

WiFi, had unique values of Free, Paid, Unknown and False. It seemed reasonable to stablish all the unknowns as 0.5.

Hence, starting from there, False is 0. We are left witn free and paid. Being free the best option we have decided to provide it with a value of 1 and paid with a vaule between free and unknown, 0.75.

In [50]:
restaurants_income['WiFi'] = restaurants_income['WiFi'].apply(
    lambda x: 1 if x == "free" else (0.75 if x == "paid" else (0.5 if x =="unknown" else 0)))

Facing the encoding of Alcohol, Smoking and BYOBCorkage, we faced a similar dilemma than with WiFi. We have decided to mantain the 0 for False, 0.5 for Unknown and give the less restrictive option a 1 (full_bar, yes and yes_free respectively) and 0.75 for the more restrictive option without being False (beer_and_wine, outdoor and yes_corkage).

In [51]:
restaurants_income['Alcohol'] = restaurants_income['Alcohol'].apply(
    lambda x: 1 if x == "full_bar" else (0.75 if x == "beer_and_wine" else (
        0.5 if x =="unknown" else 0)))
In [52]:
restaurants_income['Smoking'] = restaurants_income['Smoking'].apply(
    lambda x: 1 if x == "yes" else (0.75 if x == "beer_and_wine" else (
        0.5 if x =="unknown" else 0)))
In [53]:
restaurants_income['BYOBCorkage'] = restaurants_income['BYOBCorkage'].apply(
    lambda x: 1 if x == "yes_free" else (0.75 if x == "yes_corkage" else (
        0.5 if x =="unknown" else 0)))

In order to encode the AgesAllowed variable, we confront a different situation, here we do not have false nor true. Instead the extreme values are "allages" and "21plus". Since 21plus is the more restrictive we have give it a value of 0 and 1 to allages.

As in the rest of the variables, "unknown" stays at 0.5

The final value to encode will be "18plus". Here our decision has been to give it a 0.25, since its value resembles more to the 21plus rather than all ages.

In [54]:
restaurants_income['AgesAllowed'] = restaurants_income['AgesAllowed'].apply(
    lambda x: 1 if x == "allages" else (0.5 if x == "unknown" else (
        0.25 if x =="18plus" else 0)))

Now we update the values in restaurants_income of the variables stored in selected_columns as well as in parking_columns

In [56]:
restaurants_income = restaurants_income.drop(
    columns = ['RestaurantsDelivery', 'OutdoorSeating', 'BusinessAcceptsCreditCards',
       'BikeParking', 'RestaurantsTakeOut', 'ByAppointmentOnly', 'Caters',
       'WheelchairAccessible', 'RestaurantsReservations', 'CoatCheck',
       'DogsAllowed', 'RestaurantsTableService', 'HasTV', 'HappyHour',
       'DriveThru', 'BusinessAcceptsBitcoin', 'AcceptsInsurance',
       'RestaurantsCounterService', 'dairy-free', 'gluten-free', 'vegan',
       'kosher', 'halal', 'soy-free', 'vegetarian', 'Premium_Parking','Non_Premium_Parking'])
In [57]:
restaurants_income = restaurants_income.join(selected_columns, how="left")
restaurants_income.head()
Out[57]:
state postal_code stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsPriceRange2 WiFi Alcohol Smoking BYOBCorkage AgesAllowed num_days_open num_hours_open average_income RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsTakeOut ByAppointmentOnly Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin AcceptsInsurance RestaurantsCounterService dairy-free gluten-free vegan kosher halal soy-free vegetarian
business_id
MTSW4McQd7CbVtyjqoe9mw PA 19107 1 80 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1.0 0.0 0.5 0.5 0.5 7.0 94.0 57735.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0
mWMc6_wTdE0EUBKIGXDVfA PA 18054 1 13 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.5 0.5 0.5 0.5 0.5 5.0 40.0 71196.0 0.5 0.5 1.0 1.0 1.0 0.5 0.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0
CF33F8-E6oudUQ46HnavjQ TN 37015 0 6 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.5 0.5 0.5 6.0 66.0 54097.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.5 0.5 0.5 0 0 0 0 0 0 0
k0hlBqXX-Bt0vf1op7Jr1w MO 63123 0 19 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1.0 1.0 0.5 0.5 0.5 7.0 63.0 58867.0 0.0 1.0 1.0 0.5 1.0 0.5 1.0 0.5 0.0 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0
bBDDEgkFA1Otx9Lfe7BZUQ TN 37207 0 10 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.5 0.5 0.5 6.0 76.0 42973.0 1.0 1.0 1.0 0.5 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.5 0.5 0.5 0 0 0 0 0 0 0
In [58]:
restaurants_income = restaurants_income.join(parking_columns, how="left")
restaurants_income.head()
Out[58]:
state postal_code stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsPriceRange2 WiFi Alcohol Smoking BYOBCorkage AgesAllowed num_days_open num_hours_open average_income RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsTakeOut ByAppointmentOnly Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin AcceptsInsurance RestaurantsCounterService dairy-free gluten-free vegan kosher halal soy-free vegetarian Premium_Parking Non_Premium_Parking
business_id
MTSW4McQd7CbVtyjqoe9mw PA 19107 1 80 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1.0 0.0 0.5 0.5 0.5 7.0 94.0 57735.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 1
mWMc6_wTdE0EUBKIGXDVfA PA 18054 1 13 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.5 0.5 0.5 0.5 0.5 5.0 40.0 71196.0 0.5 0.5 1.0 1.0 1.0 0.5 0.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 1
CF33F8-E6oudUQ46HnavjQ TN 37015 0 6 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.5 0.5 0.5 6.0 66.0 54097.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0
k0hlBqXX-Bt0vf1op7Jr1w MO 63123 0 19 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1.0 1.0 0.5 0.5 0.5 7.0 63.0 58867.0 0.0 1.0 1.0 0.5 1.0 0.5 1.0 0.5 0.0 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 1
bBDDEgkFA1Otx9Lfe7BZUQ TN 37207 0 10 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.5 0.5 0.5 6.0 76.0 42973.0 1.0 1.0 1.0 0.5 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0

Finally we are left with 2 columns that represent location: state and postal_code. Despite postal_code yields more information the number of unique values is more than 80 times more than state. Our intention is create dummy variables and include more than 600 new variables does not seem something reasonal.

Moreover, in the case we drop postal_code, we still would be able to differenciate between different states, as the average_income variable depends on the zip code. For these reason we will drop the variable postal_code and we will create dummies for the state variable.

In [59]:
restaurants_income = restaurants_income.drop(columns='postal_code')
In [60]:
restaurants_income["state"].unique()
Out[60]:
array(['PA', 'TN', 'MO', 'FL', 'AZ', 'LA', 'CA', 'ID'], dtype=object)
In [61]:
dummy_vars = pd.get_dummies(restaurants_income['state'], prefix='state')
dummy_vars.head()
Out[61]:
state_AZ state_CA state_FL state_ID state_LA state_MO state_PA state_TN
business_id
MTSW4McQd7CbVtyjqoe9mw 0 0 0 0 0 0 1 0
mWMc6_wTdE0EUBKIGXDVfA 0 0 0 0 0 0 1 0
CF33F8-E6oudUQ46HnavjQ 0 0 0 0 0 0 0 1
k0hlBqXX-Bt0vf1op7Jr1w 0 0 0 0 0 1 0 0
bBDDEgkFA1Otx9Lfe7BZUQ 0 0 0 0 0 0 0 1

FInally we add them to the rest of the dataset.

In [62]:
restaurants_income = pd.concat([restaurants_income.drop(columns='state'), dummy_vars], axis=1)
restaurants_income.head()
Out[62]:
stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsPriceRange2 WiFi Alcohol Smoking BYOBCorkage AgesAllowed num_days_open num_hours_open average_income RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsTakeOut ByAppointmentOnly Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin AcceptsInsurance RestaurantsCounterService dairy-free gluten-free vegan kosher halal soy-free vegetarian Premium_Parking Non_Premium_Parking state_AZ state_CA state_FL state_ID state_LA state_MO state_PA state_TN
business_id
MTSW4McQd7CbVtyjqoe9mw 1 80 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1.0 0.0 0.5 0.5 0.5 7.0 94.0 57735.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
mWMc6_wTdE0EUBKIGXDVfA 1 13 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.5 0.5 0.5 0.5 0.5 5.0 40.0 71196.0 0.5 0.5 1.0 1.0 1.0 0.5 0.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
CF33F8-E6oudUQ46HnavjQ 0 6 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.5 0.5 0.5 6.0 66.0 54097.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
k0hlBqXX-Bt0vf1op7Jr1w 0 19 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1.0 1.0 0.5 0.5 0.5 7.0 63.0 58867.0 0.0 1.0 1.0 0.5 1.0 0.5 1.0 0.5 0.0 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
bBDDEgkFA1Otx9Lfe7BZUQ 0 10 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.5 0.5 0.5 6.0 76.0 42973.0 1.0 1.0 1.0 0.5 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Normalization¶

Since all the data in the dataset left is numeric, in order to normalize the dataset we transform all the columns to type ìnt64

In [63]:
restaurants_income = restaurants_income.apply(lambda x: x.astype('int64'))

In order to scale the variables we will use the MinMaxScaler function form the sklearn library, transforming all the values into a range of 0 and 1

In [64]:
scaler = MinMaxScaler()
restaurants_income_scaled = pd.DataFrame(
    scaler.fit_transform(restaurants_income), columns=restaurants_income.columns)
In [65]:
restaurants_income_scaled.head()
Out[65]:
stars review_count is_open restaurants food bars sandwiches american (traditional) pizza coffee & tea breakfast & brunch american (new) fast food burgers mexican italian specialty food seafood desserts bakeries salad chinese cafes chicken wings ice cream & frozen yogurt beer sports bars cocktail bars juice bars & smoothies barbeque japanese pubs sushi bars local flavor asian fusion diners steakhouses mediterranean wine bars southern cajun/creole donuts tacos soup thai beer bar tex-mex greek breweries vietnamese chicken shop hot dogs indian bagels comfort food cheesesteaks ethnic food caribbean middle eastern french soul food buffets gastropubs fruits & veggies korean bubble tea tapas/small plates food stands cupcakes noodles wineries cuban spanish acai bowls RestaurantsPriceRange2 WiFi Alcohol Smoking BYOBCorkage AgesAllowed num_days_open num_hours_open average_income RestaurantsDelivery OutdoorSeating BusinessAcceptsCreditCards BikeParking RestaurantsTakeOut ByAppointmentOnly Caters WheelchairAccessible RestaurantsReservations CoatCheck DogsAllowed RestaurantsTableService HasTV HappyHour DriveThru BusinessAcceptsBitcoin AcceptsInsurance RestaurantsCounterService dairy-free gluten-free vegan kosher halal soy-free vegetarian Premium_Parking Non_Premium_Parking state_AZ state_CA state_FL state_ID state_LA state_MO state_PA state_TN
0 1.0 0.009917 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 1.0 0.0 0.0 0.0 0.0 1.000000 0.559524 0.180427 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 1.0 0.001058 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.333333 0.0 0.0 0.0 0.0 0.0 0.666667 0.238095 0.237808 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 0.000132 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.833333 0.392857 0.164920 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 0.0 0.001851 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 1.0 1.0 0.0 0.0 0.0 1.000000 0.375000 0.185253 0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 0.0 0.000661 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.833333 0.452381 0.117501 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

Final data exportation¶

Finally, we take our data and export it into a csv under the path "../data/processed/". This will be from where we will import our data in the two final notebooks.

In [66]:
restaurants_income_scaled.to_csv('../data/processed/restaurants_models.csv', index=False)